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Abstract 

We study some of the most commonly used mutual information estima- 
tors, based on histograms of fixed or adaptive bin size, fc-nearest neighbors 
and kernels, and focus on optimal selection of their free parameters. We ex- 
amine the consistency of the estimators (convergence to a stable value with 
the increase of time series length) and the degree of deviation among the es- 
timators. The optimization of parameters is assessed by quantifying the de- 
viation of the estimated mutual information from its true or asymptotic value 
as a function of the free parameter. Moreover, some common-used criteria 
for parameter selection are evaluated for each estimator. The comparative 
study is based on Monte Carlo simulations on time series from several lin- 
ear and nonlinear systems of different lengths and noise levels. The results 
show that the fc-nearest neighbor is the most stable and less affected by the 
method-specific parameter. A data adaptive criterion for optimal binning is 
suggested for linear systems but it is found to be rather conservative for non- 
linear systems. It turns out that the binning and kernel estimators give the 
least deviation in identifying the lag of the first minimum of mutual informa- 
tion from nonlinear systems, and are stable in the presence of noise. 
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1 Introduction 



Mutual information (MI) is a nonlinear measure used in many aspects of time series 
analysis, best known as a criterion to se lect the appropriate delay for state space 



regim es of nonlinear systems I Hively et aL ._ 2000 : Naa et all 



20Q7H and to detect phase synchronization ISchmid et all 12004; 



2002 




Wick 


:s et al. 


Kreuz et al. 


, 2007] 



Besides nonlinear dynamics, it is used in various statistical settings, mainly as a 
distance or correlation measure in d ata mining, e.g . in independent compon ent 
analysis and feature-based clustering jTourassi et alll200ll ; |Priness et all 1200711 . 

Any estimate of MI, either between two va riables or as a function of delay for 
time series, is (a l most always) positively biased IITreves & PanzeriL 1 19951 : iModdemeijerj . 



1989; IPaninskiL 12003k iMicheas & ZografosL 2006]. For numerical- valued vari- 



ables, MI increases with finer partition depending on the underlying distribution 
and the sample size. Beyond the classical domain partitioning, other schemes 
have been used to estimate the densities inherent in the measure of mutual in- 



formation, e.g. kernels, B -splines and /c-nearest neighbors 


IMoon et al. 


1995; 


Diks & Manzan, 


2002 


Daub et al. 


2004 


Kraskov et al. 


2004 


■ 



There are generally few analytic results on MI. Expressions of MI in terms of 
the correlation coefficient are ob t ained for some known distributi ons, e.g. Gaussian 
and Gamma distribution I Pardol 1995 : Hutter & Zaffalon . 2005 1. Some statistical 
results on the mea n, variance and bias of the MI estimator using fixed partitioning 
can be found in llRoulstonl 1 19971 : Kbarbanel et"all bOOlh . but the distribution of 
any MI estimator is not known in general. For chaotic systems in particular, the 
discontinuity of the density function of their variables does not allow for an ana- 
lytic derivation of the statistics of MI estimators. Therefore, comparisons of MI 
estimators relies on simulation studies. In some studies the estimators are tested 



in identifying corr ectly the lag of the first minimum of MI [Mo on et all 11995 



Cellucci et all 1200511 . Another performance criterion is the b ias of estimators in 



the case of Gaussian pro cesses, where the true MI is known llCellucci et all 2005; 
iTrappenberg et al. . 2006ll . 

MI estimation involves one and two dimensional density estimation. Density 
estimation has been studied extensively and diffe rent methods have been suggested 
and compared in the stat istical literature, e.g. see IScottLI 19791 : Freedman & Diaconis, 
198ll : ISilvermanL [l986], but it is still to be investigated whether these methods 



and the suggested criteria for the selection of method specific parameters are also 
suitable for MI estimation. This is the main objective of this study. MI estima- 
tors have also been compared to other linear or nonlinear correlation measures 

], but we do not pursue this here 



[Palus, 1995; Steueretal. 



as direct comparison is not possible due to the different scaling of the measures, 
even after normalization. There are some comparative studies on the MI estima- 
tors and the selection of their parameters , as well as on t heir performance on both 
linear and nonlinear dynamical systems BWand & Jonesl 1 19931 : ISteuer et all 120021 : 
Nicolaou & Nasutol 2005 : Khan et al. . 2007 1. Here, we extend these studies, we 
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evaluate some of the most commonly used MI estimators, examine their consis- 
tency and optimize the selection of their parameters including criteria for parameter 
selection suggested in the literature. As the estimation depends on the underlying 
time series we use Monte-Carlo simulations on systems of white noise of different 
distributions, stochastic linear systems and dynamical non-linear systems (maps 
and flows of varying complexity). The performance of each estimator is examined 
with respect to the time series length, the distribution of noise and the noise level 
in the systems. 

We study here the performance of three types of estimators, i.e. estimators 
based on histograms (with fixed or adaptive bin size), /c-nearest neighbors and 
kernels. All estimators vary in the estimation of the densities at local regions and 
we investigate the optimal parameter for the determination of the two-dimensional 
partitioning. Based on the simulation results, we propose optimal parameters for 
each MI estimator with regard to the complexity of the system, the observational 
noise level and the time series length. 

The structure of the paper is as follows. In Sec. El we briefly discuss the estima- 
tors considered in this study and in Sec. [3] we present the evaluation procedure and 
the simulated systems. In Sec.|4l we give quantitative results on the dependence of 
the estimators on the parameters, time series length and noise, we propose optimal 
parameter selection and compare the different MI estimators. Finally, in Sec. [5] we 
discuss the results and draw conclusions. 



2 Mutual Information Estimators 

MI is a measure of mutual dependence between two random variables and quanti- 
fies the amount of uncertainty about one variable reduced when knowing the other. 
The MI of two continuous random variables X, Y has the form 

I(X,Y)= f [ f xx (x, y) log a fx ' Y ^ f dxdy, (1) 

where Jx.y(x, y) is the joint probability density function (pdf) of X and Y, whereas 
fx(x) and fy(y) are the marginal pdfs of X and Y, respectively. The units of in- 
formation of I(X, Y) depend on the base a of the logarithm, e.g. bits for the base 
2 logarithm and nats for the natural logarithm. 

Assuming a partition of the domain of X and Y , the double integral in Eq.(Q} 
becomes a sum over the cells of the two-dimensional partition 

X(X, Y) = Y^Pxxihj) log a PX ; Y S % ' 3 ?. V (2) 
~ PXWPYKJ) 

where px(i), PyU), and px,y{i,j) are the marginal and joint probability mass 
functions over the elements of the one and two-dimensional partition. In the limit 
of fine partitioning the expression in Eq.© converges to Eq.{T]). This may partly 
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justify the abuse of notation of MI for the continuous and the discretized vari- 
ables. It is always I(X, Y) > 0, with equality holding for independent vari- 
ables, and Z(X, Y) < H(X) < log a n (Jensen inequality), where H{X) = 
Y17=iP( x i) l°SaP( x «) i s me Shannon entropy of X. For a time series {X t }™ =1 , 
sampled at fixed times r s , MI is defined as a function of the delay r assuming the 
two variables X = X t and Y = X t - T , i.e. 1(t) = T{X t , X t - T ). 



T he Shannon entropy is alway s misestimated due to finite sample effects IIGrassbergeii. 



19881 : iKantz & ShurmannL [l996], but we do not discuss MI in terms of entropies 



here as the estimation of MI boils down to the estimation of the densities in Eq.(Q} 
or probabilities in Eq.©. The estimators of MI, denoted I(t), differ in the estima- 
tion of the marginal and joint pr obabilities or densities , using binning |Fraser & Swinneyl 
19861 : barbellay & VaidaL ll999T|, kernels JsilvermanL 1 19861 : hvloon et all Il995h or 



correlation integrals | Diks & Manzanll2002ll. fc-nea rest neighbors llPaninskil 120031 : 



2004] or the Gram-Charlier polyno- 



Kraskov et all I2004J1. ^-splines ibaub et all , - . , 

mial expansion jBlinnikov & Moessner , 1998ll . All these estimators depend on at 
least one parameter. We present bellow the three first estimators that are most 
widely used. 



2.1 Binning estimators 

The most common MI estimator is the naive equidistant binning estimator (ED) 
that regards the partition of the domain of each variable into a finite number b 
of discrete bins (equidistant partitioning). The probability at each cell or bin is 
estimated by the corresponding relative frequency of occurrence of the samples 
in the cell or bin. The number of bins for each variable is the same, so that the 
parameter to be optimized is the number of bins for the partition or equivalently 
the bin width. The computation of this MI estimator is straightforward as it is 
directly estimated from the one and two dimensional histograms. 

A second binning estimator is the equiprobable binning estimator (EP), which 
is derived by partitioning the domain of each variable in b bins of the same occu- 
pancy (equiprobable partitioning) but different width. The equiprobable partition- 
ing actually transforms the sample univariate distribution to discrete uniform with 
b c omponents minimiz i ng the effect of the univariate distribution on the estimation. 

Fraser & Swinnev [1986] suggested an estimator using an adaptive partition- 



ing. This method constructs a locally adaptive partition of the two-dimensional 
plane. It starts with a partition of equiprobable bins for each variable and makes 
finer partition in areas where the joint probability density is non-uniform until the 
joint distribution on the cells is approximately uniform. The final partition is finer 
in dense regio ns whereas l ess occupied regions are covered with larger cells. It 
was found in llPalusl 1 19931 : ICellucci et all 1200511 that this complex algorithm does 
not substantially improve the binning estimator and requires large data sets to gain 
accuracy; therefore it is not included in the current evaluation. 

A different estima t or ma king use of adaptive partitioning (AD) is proposed 
by iDarbellay & Vajdal lll999tl . The partition consists of rectangles specified by 
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marginal empirical quantiles, which are not uniform in the sense that they are not 
made of a grid of vertical and horizontal lines, irrespectively of whether these lines 
are equally spaced or not. The AD estimator builds such a partition in a way 
that it achieves conditional independence on the rectangles of the partition. The 
advantage of this estimator is that it is data-adaptive and does not a priori determine 
the number of bins in the partition. The AD estimator has a direct dependence on 
n, which determines the roughness of the partitioning in a somehow automatic 
way. In the abundance of data, the AD estimator reaches a very fine partition that 
satisfies the independence condition in each cell, so that the total number of cells is 
very large and analogous to a fixed-partition with a respectively large b. Note that 
the dependence of AD on n is not comparable to that of the fixed-bin estimators 
because it involves a change of partitioning with n. 

For any binning scheme, the MI estimator J(r) is given by Eq.© where the 
variables are (X, Y) = (X t , X t - T ), the sum is referred to the partition of the two- 
dimensional domain of (X t ,X t - T ) and px t , Px t ^ T , Px t ,X t - T are the marginal and 
joint probability distributions defined for each cell of the partition. 



2.2 fc-nearest neighbor estimator 

Kras kov et al. [2004] proposed an MI estimator (KNN) that uses the distances of 



A;-nearest neighbors to estimate the joint and marginal densities. For each refer- 
ence point from the bivariate sample, a distance length is determined so that the 
k nearest neighbors are within this distance length. Then the number of points 
within this distance from the reference point gives the estimate of the joint density 
at this point and the respective neighbors in one-dimension give the estimate of 
the marginal density for each variable. The algorithm uses discs (or squares de- 
pending on the metric) of a size adapted locally and then uses the corresponding 
size in the marginal subspaces, so in some sense the estimator is data adaptive. 
Still, it involves as a free parameter the number of neighbors k. Note that a large 
k regards a small b of the fixed binning estimators. However, the estimator does 
not use a fixed neighborhood size and therefore there is not a clear association of 
k and b. The KNN estimator is data efficien t, adaptive, has minim al bias and is 
recommended for high-dimensional data sets IIKraskov et all 120041. It requires an 
additional computational cost for the search of the k neighbors. 



2.3 Kernel density estimator 

The kernel density MI estimator (KE) uses a smooth estimate of the unknown prob- 
ability density by centering kern el functions at the data samples; ker nels are used 
to obtain the weighted distances I Silvermanlll986l : lMoon et al , 1995 1. The kernels 
essentially weigh the distance of each point in the sample to the reference point de- 
pending on the form of the kernel function and according to a given bandwidth h, 
so that a small h produces details in the density estimate but may loose in accuracy 
depending on the data size. 



5 



KE estimator has two free parameters, the bandwidth h\ for the marginal den- 
sities of X and Y, and the bandwidth h% for the joint density of (X, Y). The 
bandwidth h\ is related to b of the fixed binning estimators by an inverse relation, 
e.g. a rectangular kernel assigns a bin centered at the reference point. Its advan- 
tage over binning estimators is that the location of the bins is not fixed. Among the 
different kernel functions, Gaussian kernels are most commonly used and we use 
them here as well. A kernel density estimator with Gaussian kernel function and a 
fixed bandwidth h at a point x £ R d is 

where S is the data covariance matrix and n' is the number of the d-dimensional 



vectors [Moon et al. 19951. 



3 Simulation Setup 

The evaluation of the estimators is assessed by Monte-Carlo simulations on white 
noise, linear systems and chaotic systems of different complexity, listed in Table Q] 

TABLE 1 *** To be placed here *** 

A Gaussian and a skewed Gamma distribution are used to generate white noise time 
series, whereas for linear systems, the autoregressive model AR(1) and autoregres- 
sive moving average ARM A( 1,1) are used with coefficients as given in Table [H 
assuming both Gaussian an d Gamma inp ut white noise. The non-linear chaotic 
systems are the Henon map jHenonL[l97fll 



x t = 1 - axf_ 1 + bx t -2, 



the Ikeda map Uke da et al.Lll98CJl 



x t = a + b(x t -i cos ut-x-yt-i sin ut-x] 
y t = b(x t -i sinn t _i + y t -\ cosu t -i), 



where u t = k— \ a , and the Mackey-Glass differential system IMackev & Glassl 



1977H 



dx 0.2x t ^ A n . 

— — = tt; O.lXf 

dt l + xf_ A 

where the delay A accounts for the system complexity. We use here A = 17 and 
A = 30 for low-dimensional chaos of fractal dimension about 2 and 3, respectively, 
and A = 100 for high-dimensional chaos of fractal dimension about 7. Observa- 
tional white noise at different levels is also assumed for the chaotic systems, given 
as a percentage of the standard deviation of the noise-free data. 
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Different lengths n for the generated time series from each system are con- 
sidered as follows. For white noise and linear systems, n is given in powers of 2 
from 5 to 13 and for nonlinear systems from 8 to 13. I(t) is computed using all 
methods on 1000 realizations for each system, noise type or level, and time series 
length. As all linear systems are of order 1, MI is computed only for lag 1. For the 
nonlinear systems, I(t) is computed up to the lag r for which /(r) levels-off. For 
the Mackey-Glass system we compute the lag of the first minimum of I(t) and 
specifically for A = 100 the lag that MI levels-off because it does not exhibit a 
distinct minimum. For each estimator, I{t) is computed for a wide range of values 
of the free parameter and for specific values determined by standard criteria, which 
are specified below. 

For the binning estimators ED and EP we set the number of bins to b = 
2, 4, 8, 16, 32, 64. We also consider 10 commonly used criteria for b, given in Ta- 
bled 



TABLE 2 *** To be placed here 



For the choice of k of A;-nearest neighbor estimator KNN, Krask ov et al.l f 20041 



jropo se to use k = 2 to 4 (these are also used in IlKreuz et all 120071 : iKhan et al. 
2007]). However for real world data one should investigate also larger values 
of k. Therefore we use in the simulations a wide range of k values as for b, 
k = 2,4,8,16,32,64. 

Among different kernel functions used in the literature for density estimation, 
and for MI estimation in particular, the common pract ice is to use the Gaussian 
kernel in conjunction with the "Gaussian" bandwidth of lSilvermanl 1198611 



(d + 2)ri 



l/(d+4) 



(4) 



(n! is the number of the d-dimensional vec tors) or multiples of it [H arrold et al. 



2001; 


Steuer et al., 


2002; 


Khan et al. , 


20071 



tion with kernels, the range of bandwidths is usually not searched and a bandwidth 



is selected according to a criterion such as the "Gaussian" bandwidth iMoon et al. 



19951 : ISteuer et all 1200211 . A multiple bandwidth selec tion scheme for the test for 
independence is proposed in IIDiks & Panchenkol . 1200811 . Analytic and simulation 
studies ha ve shown that the choice of the bandwidth i s cruci al and depends on the 



data size I Bonnlander & Weigend , 1994 : Jones et al. , 1996ll . Therefore we con 



sider a wide range for the bandwidth h\ for one dimension and hi for two dimen- 
sions, as for b and k. Specifically, for h\ we take 15 values in [0.01, 2] at a fixed 
base-2 logarithmic step and set hi = h\ and hi = s/2~h\. The second form for hi 
accounts for the scaling of the Euclidean metric in 3ft 2 , which we use in the simu- 
lations. We also consider some well-known criteria for the choice of bandwidths, 
given in Table [3] 
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TABLE 3 *** To be placed here *** 



The three first criteria define bandwidth for both one and two-dimensional space. 
For the other criteria we set J12 equal either to h\ or y/2h\. 

The true (theoretical) MI X(r) is not known in general. However, for Gaussian 
processes Z(r) is given in terms of the autocorrelation function p(r) as 

J(r) = -0.51og(l-p 2 (r)). (5) 

In the lack of the true MI for the other systems, we assume consistency of the es- 
timators and use the asymptotic value of J(r) computed on a realization of size 
n = 10 7 . In this computation, we set for the ED and EP estimators b = 64 (we 
also computed MI for b up to 256, however MI did not substantially differed), 
for KNN estimator k = 2, and for the KE estimator h\ = 0.01 and hi = h-\ 



Similar approach fo r approximating the true MI is used in BHarrold et all 12001 



Cellucci et all. 1200511 . We denote 1^ the true or asymptotic MI and use it as refer- 
ence to compute the accuracy of the different estimators. 

We evaluate the estimators and their parameters separately for white noise and 
linear systems, for the nonlinear maps Henon and Ikeda, and for the Mackey-Glass 
system. First, we investigate the dependence of the estimators on their free param- 
eter for white noise and linear systems and for different time series lengths giving 
a total of L = 126 cases (14 systems and 9 time series lengths). For each estima- 
tor, we compute the mean estimated MI I c (l) from the 1000 realizations for each 
tested value of the free parameter denoted by c, where 1 denotes the system case and 
/ = 1, L. Further, for each I we compute the deviation dl c (l) = \I c (l) — -Too(OI> 
where Ioo(0 is either the true MI (for Gaussian processes) or the asymptotic value 
computed from each estimator. All estimators converged to the same MI under 
proper parameters as obtained for n increasing up to 10 7 . Given the asymptotic MI 
Ioo{l), the optimal parameter values for each case I (system and time series length) 
is obtained from the minimum deviation dl c (l) with respect to c. The estimators 
are then compared for their optimal parameters by computing the divergence of the 
mean I{r) of each estimator from 7oo(0 for all cases. 

For the discrete nonlinear systems, there are L = 48 cases (8 maps including 
different noise levels and 6 time series lengths). Here, a single 1^ for each system 
cannot be obtained as the MI estimate for n increasing up to 10 7 does not converge 
and the MI for n = 10 7 is still dependent on parameter selection and varies also 
across estimators. Therefore, for each system we set as asymptotic value 1^ of the 
estimator the MI computed for n = 10 7 and for a very fine partition. So here, the 
interest is in the dependence of each estimator on the free parameter and the rate 
of convergence towards the asymptotic value. 

For the nonlinear flows derived from the Mackey-Glass system for A = 17, 30 
(noise-free and with noise), we concentrate on the first minimum of MI and com- 
pute the lag to of the first minimum of MI for each of the 1000 realizations of each 
system. For A = 100 there is no clear minimum of MI and it follows a rather 
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exponential decay. Therefore we compute instead the lag tq for which MI levels 
off according to a criterion for levelling. In order to compare the estimators, we 
examine the consistency of each estimator with n, the dependence of the estima- 
tion of ro on the parameter selection and the variance of the estimated lags to from 
all cases. 

4 Results 

4.1 Results on white noise and linear systems 

The MI for lag one 1(1) from the binning estimators ED and EP increases always 
with the number of bins b. Thus for white noise where ioo(l) = the best choice 
for 6 is 2 that gives the smallest positive 1(1). For the linear processes, 1(1) de- 
creases with the time series length n for each b, as shown in Fig. Q] for the ED and 
EP estimates of 1(1) from an AR(1) process. 



Figure 1 *** To be placed here *** 



We note that for sufficiently large b, 1(1) converges with n to the true value 
1(1) = /oo(l) given in ([5]>. On the other hand, for small b, 1(1) underestimates 
/oo(l) depending again on n. Thus the optimal b that gives the smallest dl c (l) and 
estimates best 1^(1) depends on n. 

We have observed that b depends also on the autocorrelation function r(r) of 
the linear system. To investigate further this dependence we computed ED and EP 
estimates of 1(1) for a wide range of r(l) values of an AR(1) process. We found 
that the optimal b (found by the smallest dl c (l)) increases smoothly with n and 
r(l), as shown in Fig. [2] 



Figure 2 *** To be placed here *** 

A search for a parametric fit of optimal b regarding the graph of Fig. [2}) resulted in 
the form 

b = Q /e w2 (6) 

where the coefficients a, (3, 7 take similar values for the ED and EP estimators 
(0.65,0.25,2.11 and 0.76,0.19,1.91 respectively). 

Most of the criteria in Table [2] tend to overestimate b. To evaluate the perfor- 
mance of the 10 criteria in Tableland the proposed criterion in Eq.©, we compute 
the total score S c of each criterion c, where c = 1, . . . , 11, for all L = 126 tested 
systems and time series lengths, as 

s = Ef = i(^(0-/oo(0) 2 

C EtliU) - Ioo(l)) 2 
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where for each case I, I c (l) = tt Ylc=l ^c(0 is the grand mean of the means 
from all criteria. According to the score S c , the proposed criterion in Eq.© for the 
optimal b, denoted Hll, outperforms the other criteria when ED estimator is used, 
as shown in Tabled 

Table 4 *** To be placed here *** 

For the EP estimator, criterion H9 scores lowest and Hll is ranked fifth but the 
differences in the scores of the best five criteria are comparatively small. 

For certain bivariate distributions and Gaussian processes, it was found that 
the AD estimator was precise in estimating M I and converged fast to the true MI 
I Kraskov et al. . 2004 : Trappenberg et al. . 2006ll . We confirmed this result by our 



simulations on the white noise and linear systems with the remark that the con- 
vergence to loo is rather slow and is succeeded at large n, as shown in Fig. [3] 

Figure 3 *** To be placed here *** 



The number of nearest neighbors k in the KNN estimator determines the rough- 
ness of approximation of the density functions in Eq.£l]), which corresponds to the 
roughness of the partitioning in Eq.©. The simulations showed that for white 
noise the MI estimated by KNN is close to zero for a long range of k and the devi 



ation from zero decreases as k approaches n/2 (as reported also in BKraskov et al. 



2004]). For the linear systems the optimal k is rather small. The dependence of the 



KNN estimator on k holds mainly for small time series as for large n the estimated 
MI converges to 1^ for any k, as shown in Fig. [4^. 

Figure 4 *** To be placed here *** 

Still, the convergence is slower for larger k. In any case, a highly accurate MI is at- 
tained with small k for all but very small time series. For example, for an accuracy 
threshold of 10~ 4 in estimating Iqo, i.e. — 2oo| < 10~ 4 , the optimal choice for 
A; is 2 in almost all cases except for very small n and r (1), as shown in Fig.|4j). Note 
that even for white noise time series of small length, k < 8 reaches this accuracy 
threshold (the peak in the graph of Fig.[4t> is for n = 2 5 and r(l) = 0). 

For the two dimensional bandwidth /12 we have considered /i2 = hi and 
/12 = \/2hi and studied the dependence of the estimated MI on h\ and /12 across 
a large range of bandwidths for white noise and linear systems. As for k of the 
KNN estimator, MI converges with n and faster for smaller h\. For /12 = h\ the 
convergence with n is correctly towards 1^ (see Fig. UK), but for /12 = y/2h\ MI 
decreases with h\ and becomes negative (see Fig.[5t>). 

Figure 5 *** To be placed here *** 
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This result advocates the use of the same bandwidth for the kernel estimates 
of the marginal and joint distributions. We have also investigated whether there 
is dependence of hi on r(l) and n. As for k of the KNN estimator, there does 
not seem to be any systematic dependence. Using the same threshold accuracy in 
estimating 1^, the smallest optimal h\ is always at a low level for all r(l) and n 
and there is no apparent pattern that would suggest a particular form of dependence 
of h\ on r(l) or n, as shown in Fig. [5]:. The sudden jumps in the graph of Fig.[5fc is 
due to numerical discrepancies around the chosen threshold for different hi values. 

In order to evaluate the criteria for selecting hi and h% in Tabled we computed 
for each criterion the score defined in Eq.© for all cases. The five optimal criteria 
and their scores for varying lengths of time series from white noise and linear 
systems are CI (0.85), C3 (0.94), C2 (0.96), C4 (1.57) and C9 (1.67). The simplest 
criteria turned out to score lowest with best being the "Gaussian" rule of Silverman 
CI (see also Eq.©). 

Summarizing the results on white noise and linear systems, it turns out that 
fixed binning estimators are the most dependent on the free parameter, the number 
of bins b, whereas for KNN and KE estimators a small number of neighbors k and 
bandwidth hi, respectively, turns out to be sufficient for all but very small time se- 
ries length n and weak autocorrelation r(r). In such cases, binning estimators can 
approximate 1^ better with a relatively small b and we provided an expression for 
this involving n and r(l). All estimators are consistent but converge at different 
rates to the true or asymptotic MI 1^, as shown in Fig.[6]for the AR(1) system with 
weak and strong autocorrelation. 



Figure 6 *** To be placed here *** 



In general, the KNN estimator converges fastest. To this respect, the parameter- 
free AD estimator would be the second best choice after the KNN estimator be- 
cause it showed a slower convergence rate. KE estimator has about the same 
convergence rate as AD and is not significantly affected by parameter selection 
(for /12 = hi), however it would not be preferred due to its computational cost. 
The estimation accuracy of each estimator is quantified by the index dI Copt = 

Yli=i(Ic op t(l) ~ ^oo(O) 2 ' where L = 126 and c op t is the optimal free parame- 
ter found in the simulations above, i.e. Hll for b in ED and H9 for b in EP, k = 2 
for KNN, and CI for the bandwidths in KE. The smallest index dI Copt = 0.254 
was obtained by ED, followed by KE (0.302) and KNN (0.496), whereas AD and 
EP scored worse (1.865 and 2.231, respectively). Our numerical analysis on the 
linear systems and noise showed that for the three aspects of estimation consid- 
ered, i.e. parameter dependence, rate of convergence, and accuracy of estimation, 
no estimator ranks first but KNN and KE turn out to perform overall best. 
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4.2 Results on nonlinear maps 

In terms of chaotic systems, let us first note that MI can be viewed as a measure 
on the reconstructed attractor projected on 5ft 2 , i.e. on points [xt,xt- T ]'- Due to 
the fractal structure in all scales of the chaotic attractor, X(r) defined in terms 
of a partition (see Eq.©) increases with finer partition towards the limit of X(r) 
given in Eq.dlJ for the continuous space. On the other hand, for the estimation of 
entropy, and particularly the Kolmogorov-Sinai or metric entropy, it is postulated 
that there exists a so-called generation partition that gives the expected entropy 
value and further refinement to this partition does n ot increase further the computed 
entropy [Walter], 1 19751 : ICohen & ProcacciaL 1985 1. However, with regard to the 



Shannon entropy, we observed that we can only get an upper limit of MI from the 
KNN estimator with k = 2 as the estimation algorithm does not allow for a finer 
partition, whereas increasing b for the binning estimators or decreasing h\ for the 
KE estimator within the tested range does not seem to lead to convergence of MI. 

The true X(t) in Eq.(Q]) is not known since the joint distribution of [x t ,xt- T ]' 
is also not known. This prevents the direct comparison of the estimators and the 
search for optimal free parameters. The presence of noise in chaotic time series 
sets a limit to the scale where fractal details can be observed and consequently to 
the finiteness of the partition when estimating X(r). In that case, an asymptotic 
loo does exist and the performance of the estimators in terms of the free parameter 
and time series length can be compared, also for different noise levels. In the 
following, we try to delineate the differences among estimators in estimating loo 
and particularly in converging to 1^ with respect to the time series length and their 
free parameter. 

The discussion above would suggest that the estimated MI should always in- 
crease as the partition gets finer, but in practice this requires a sufficient time series 
length n. For the binning estimators the optimal number of bins b, i.e. the b giv- 
ing largest MI and minimum |/oo(t) — X c (t) | , is not always the largest (limited to 
b = 64 in our study) but increases with n, as shown in Fig. for the ED estimator 
and the noise-free Henon map. 

Figure 7 *** To be placed here *** 

In the same figure the limits for optimal b from the suggested criterion Hll in 
Eq.© (lower for r(l) = and upper for r(l) = 1) are shown with dotted lines and 
are well beyond the optimal bins found for small lags. For this system, I(r) de- 
creases smoothly with r and therefore the optimal b decreases as well. For r = 10, 
I{t) levels off for small n and then b ~ 2 is optimal, but as n increases more bins 
give indeed larger values of MI. Thus as n increases weak MI for large r becomes 
significant and can be distinguished from the plateau of independence only when a 
larger b is used for the binning estimator. However, for a fixed b MI converges to 
loo with n, even for noise-free data (Fig. Et). 

With the addition of noise, J(r) decreases and the optimal number of bins 
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drops, as shown in Fig.[7J) and d respectively for 20% additive noise on the Henon 
time series. The stronger the noise component is, the more the deterministic struc- 
ture is masked and the faster the estimated MI levels towards zero with the lag. For 
the noisy chaotic data, the pattern of the dependence of the binning estimates of 
MI to n and b is closer to the one observed for the linear systems. For example, 
the range of optimal b in Fig. |7t> is at the level of b given by the suggested criterion 
Hll for r(l) ranging from to 1. The results on EP estimator are similar. 

In line with the ED and EP estimators, the AD estimator does not converge 
with n to Iqo for the nonlinear systems unless the fine partition is limited by the 
presence of noise, as shown in Fig.[8]for the Henon map. 

Figure 8 *** To be placed here *** 

The increase of n directs the algorithm of AD to make a finer partition which 
results in a larger I(t). The effect of n on the adaptive estimator decreases with 
the increase of the noise level. 

As pointed earlier, there is a loose relationship between the number of nearest 
neighbors k in the KNN estimator and the number of bins b in the binning esti- 
mators, i.e. small k corresponds to large b. The lower limit k = 1 corresponds 
to the finest partition for the given data, and the analogue b could be formidably 
large and is not reached in our study as b goes up to 64 (the same stands for b up to 
256). Thus direct comparison to binning estimators when k is very small cannot be 
drawn. For noise-free chaotic time series, v e ry fi ne partitions are sought and this 



agrees with the suggestion in iKraskov et all 120041] to use sma l l k at the order of 



3, which was also used in other simulation studies IIKreuz et al .. 2007; Khan et al. 



2007D. In Fig. [9^, we show for the noise-free Henon map that 1(f) increases with 



decreasing k. 

Figure 9 *** To be placed here 



For small n, a large value of k gives a poor estimation of the densities and conse- 
quently of I(t). For a fixed k, MI increases with n (see Fig. [9b)- Assuming a fixed 
k the effect of n on the KNN estimator is large similarly to the effect of n on the 
AD estimator as there in no convergence of MI with n, contrary to the fixed-bin 
estimators. In agreement to the binning estimators, the MI from the KNN estimator 
decreases with the noise level. Therefore the dependence of KNN estimator on n 
is smaller and If) converges faster to 1^ with n (Fig.[9t). Further, for larger n 
the estimation is the same regardless of the value of k. 

The dependence of the KE estimator on the bandwidth h\ is similar to the 
dependence of the KNN estimator on k. As shown in Fig.PTOk and b for the noise- 
free Henon map, If) increases with the decrease of h\ ranging from 0.01 to 2. 



Figure 10 *** To be placed here 
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We note that such extremely large values of 7(r) for very small bandwidth h\ do 
not occur by any other estimator. Given that the KNN estimator for k = 1 sets an 
upper limit for the estimated I(t) on the given time series, larger /(r) obtained 
by the KE estimator are superficially overflown estimates due to the use of an 
unsuitably small h\ for the given time series. This systematic bias for very small 
h\ is more pronounced with the addition of noise as it persists at the same level 
for larger r (see Fig.flOb). For the noisy data, /(r) decreases and differences with 
respect to the partitioning parameters are smaller, a feature we observed also with 
the other estimators (see Fig. PTOb and d). Also, the estimated I(t) is rather stable 
to the change of n. 

Regarding the 9 criteria for selecting h\ (and at cases /12, see Table©, the esti- 
mated bandwidths vary with the criterion but within a small range, e.g. for n = 256 
they are bounded in [0.13, 0.35] except C3 that always gives larger bandwidths and 
in this case hi ~ 0.7). Deviations of the estimated bandwidths hold for larger n 
but at smaller magnitudes, e.g. for n = 8192, they are bounded in [0.03, 0.18] and 
for C3 hi ~ 0.37). All criteria depend on n in a similar way and estimate smaller 
bandwidths as n increases giving larger 7(r) (see Fig. [Tib for n = 8192). 



Figure 1 1 *** To be placed here *** 



When noise is added to the time series, the estimated I(t) using different band- 
width selection criteria converge and are rather stable to the change of n (see 
Fig. Eh). 

Contrary to linear systems, for noise-free nonlinear maps, the estimated MI 
does not converge to an asymptotic 1^ and even for very large time series the MI 
values computed by different estimators vary, as we tested for n = 10 7 . For in- 
creasing n, a finer partition gives larger MI regardless of the selected estimator. 
The closest approximation to the finest partition for a large n is succeeded by the 
KNN estimator using a very small k, say k = 2 for n = 10 7 . This turned out to be 
indeed an upper bound of the estimated MI for large n. For the other estimators, 
restrictions to the partition resolution, i.e. smallest hi for KE and largest b for the 
binning estimators, bound the estimated MI to smaller values. For example, bins 
up to b = 256 for ED and EP estimators underestimate MI for n = 10 7 , meaning 
that b has to increase towards computationally prohibitive large magnitudes to suc- 
ceed an adequately fine partition for this data size. In the same way, the bandwidth 
hi has to decrease accordingly with n and for large n the KE estimator turns out to 
be computationally ineffective. The presence of noise sets a lower limit to the par- 
tition resolution and allows for an asymptotic MI value 1^ to which all estimators 
converge with n for suitably fine partition. 

The results on the different estimators were only given for the Henon map in 
order to facilitate comparisons, but qualitatively similar results are obtained from 
the same simulations on the Ikeda map. 
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4.3 Results on nonlinear flows 

When using MI on nonlinear flows the interest is often in extracting the lag tq of 
the first minimum of MI. We examine the estimate of tq with the different MI es- 
timators on the Mackey-Glass system for delays A = 17, 30 and 100 that regard 
increa sing complexity of correlation d imension being roughly 2,3 and 7, respec- 
tively BGrassberger & ProcacciaLll983ll . 

The simulations using the ED and EP estimators showed that the same to is 
estimated for all b, all re and noise levels, and for A = 17 and A = 30. For 
A = 100 we estimated the lag for which MI levels off and there was some variation 
in the selection of tq (see Fig. IT2l). 



Figure 12 *** To be placed here *** 



Although I(t) increases with b, To does not vary with b. Moreover, the estimate of 
To is stable with n and the addition of noise. 

AD estimator is also not affected by n, when computing the lag of the mini- 
mum MI tq in the Mackey-Glass system (see Fig.[T3b). 



Figure 13 *** To be placed here *** 

Addition of noise does not affect the mean To, as shown in Fig. [T3b . From sim- 
ulations on the Mackey-Glass system with A = 100 we observe that the mean 
estimated lag that MI levels off, holds for increasing n (see Fig. [T3c ) and addition 
of noise does not affect it. 

The estimation of to using the KNN estimator on the Mackey Glass systems 
varies more with n and k than for the binning estimators, as shown in Fig. [I4h and 
b. 



Figure 14 *** To be placed here *** 



With the addition of noise, the variance of the estimated to decreases with respect 
to k, and the mean is rounded to the same integer for all k. For the Mackey Glass 
system with A = 100 we observed that there is consistency with k and re, with MI 
for all k having the same shape and therefore giving the same lag for levelling off 
(see Fig. [Hfc). 

The mean estimated To using the KE estimator on realizations of each of the 
three Mackey-Glass systems is stable against changes in the time series length and 
bandwidth as for the binning estimators. 

Our simulations showed that all estimators identify sufficiently to as the shape 
of the MI function is not affected significantly by n or by the addition of noise. ED 
and KE estimators are the estimators of choice for this task, as they give smaller 
variation in the estimation of tq compared to the other estimators. 
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5 Discussion 



MI estimators are sensitive to their free parameter, with binning estimators (ED and 
EP) being the most affected. There is a loose correspondence among the different 
free parameters b, k and hi depending also on the time series length. Thus the 
differences in the performance of the estimators can be explained to some degree by 
the coarseness of the partition as determined by the free parameter. The choice of b 
for the binning estimators determines the bin size of the partition. The analogue of 
the bin size for the KNN estimator is the size of neighborhoods given by the number 
of neighbors k and for the KE estimator is the size of the efficient support of the 
kernel approximation given by the bandwidth hi (and hi). The simulation results 
have quantified the correspondence of the different free parameters and showed that 
for large time series, a suitable refined partition can be easily accommodated by a 
very small k or hi , whereas for the binning estimators the requirement for a very 
large b renders the binning estimator computationally ineffective. To this respect, 
the KNN estimator adapts easily to a refined partition by setting, say, k = 2, as 
does the adaptive binning estimator (AD) that has no free parameter, whereas h\ 
has to be further investigated at ranges of small values. 

The optimization of the parameters of the estimators is very crucial, even more 
than the choice of the estimator. Therefore, we focused on estimating the opti- 
mal free parameter of each estimator in order to fairly evaluate the estimators. For 
linear systems, we evaluated also different selection criteria for the optimal free 
parameter and based on the simulation study we proposed for the fixed-binning 
estimators the optimal b as a function of the autocorrelation and the time series 
length n. The parameter-free AD estimator tends to overestimate MI compared to 
the other estimators, indicating that the in-build partition algorithm of AD termi- 
nates at a very fine partition. The KNN estimator turns out to be the least sensitive 
to its free parameter. For example, k = 2 that gives a very fine partition does not 
deviate much for smaller time series where larger k are more appropriate. Our 
simulation results on the linear systems have shown that the KE estimator depends 
less than the binning estimators on the free parameter for the selected ranges of h\ 
and b, respectively. 

For noise-free nonlinear systems, all estimators lack consistency, i.e. the es- 
timated MI does not converge with n to an asymptotic value. Therefore, optimal 
parameter cannot be derived for these systems. The optimal parameter values found 
for the linear systems tend to give conservative estimates of MI for the nonlinear 
systems, for which a finer partition is required. This is accommodated by a small k 
in the KNN estimator. Indeed the simulation study on the different chaotic systems 
has shown that the KNN estimator has the least variance with the free parameter k 
than all other estimators. For noisy nonlinear systems, the MI from all estimators 
converge with n to an upper limit set by noise and KNN estimator for k = 2 turned 
out to reach this limit faster. 

For the computation of the lag To of the first minimum of MI, the binning 
estimators ED and EP as well as the KE estimator seem to perform best. For 
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the Mackey-Glass system, we observed that although I(t) may vary with the free 
parameter of the estimator and n, tq is rather stable. The addition of noise does not 
seem to effect the estimation of To. 

The KE estimator has the highest computational cost and the fixed-binning es- 
timators become computationally intractable when b has to be very large, as for 
long chaotic time series. On the other hand, the KNN estimator is rather fast for 
long time series that require small k for which neighbor search is faster. The com- 
putation efficiency of the AD estimator is comparable to that of KNN and these 
two estimators seem to be the most appropriate for all practical purposes in terms 
of computational efficiency, parameter selection (small k for KNN and no free pa- 
rameter for AD) and accuracy of estimation (with KNN scoring better than AD). 

We note that the consistency of estimators of MI on linear systems is not indica- 
tive of the behavior of the estimators on nonlinear systems. Although consistency 
of estimators is claimed in some recent works, this might be due to the use of only 
linear systems or noisy real data, such as EEC 
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Table 1 : The simulation systems and their parameters. The input white noise for 
the linear systems (rows 4 to 7) and the observational white noise for the nonlin- 
ear systems (rows 8 to 10) have zero mean and standard deviation one. Gamma 
noise is skewed with 7 = 0.5.The parameter notations are /j, for the mean, a for 
the standard deviation and 7 for the skewness coefficient, ip for the coefficient of 
the autoregressive part for AR(1) and ARMA(1,1) and 1? for the coefficient of the 
moving average part of ARMA(1,1), t s for the sampling time and 5 for the dis- 
cretization time for the Mackey-Glass system. The noise levels considered for the 
nonlinear systems are 20% and 40%. 



Systems 


Parameters 


Noise 


Gaussian white noise 
Gamma white noise 


H = 0, a = 1 

H = 0, a = 1, 7 = 0.5 




AR(1) 
AR(1) 


<f = 0.5,0.9,-0.5,-0.9 
ip = 0.5,0.9, -0.5,-0.9 


Gaussian 
Gamma 


ARMA(1,1) 
ARMA(1,1) 


(p = 0.9, 1? = 0.6 & <p = 0.7, 1? = 0.3 
p = 0.7, ^ = 0.3 & ip = 0.3,i? = 0.1 


Gaussian 
Gamma 


Henon 
Ikeda 

Mackey-Glass 


a = 1.4,6 = 0.3 

a = 1.0,6 = 0.9, k = 0.4, 77 = 6.0 
A = 17, 30, 100, t s = 17, S = 0.1 


Gaussian 
Gaussian 
Gaussian 
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Table 2: Criteria for the selection of the number of bins. The parameters in the 
expressions are the time series length n, the standard deviation s, the interquartile 
range IQR, the range of the data R and the standardized skewness 72 as defined in 
[Doane, Il97dl . The exact expressions for criteria H8 and H9 can be found in the 
in the corresponding references, given in the third column. 



Criteria 


Number of bins 


Reference 


HI 


1 + log 2 n 


rSturee. 19261 


H2 


1.87(n- l) u - 4 


[Bendat & Piersol, 1966] 


H3 


1 + log 2 n + log 2 72 


[Doane, 1976] 


H4 


s/n 


[Tukev & Mosteller, 1977] 


H5 


3.49.S 


rScott. 19791 


H6 


fin 1 / 3 
2(IQR) 


[Freedman & Diaconis. 19811 


H7 




rTerrell& Scott. j9851 


H8 


min. of stochastic complexity 


[Rissanen, 1992] 


H9 


mode of log of marginal posterior pdf 


[Knuth, 2006] 


H10 




[Cochran, 1954] 
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Table 3: Criteria for the selection of bandwidths for one (h\) and two (/12) dimen- 
sions. The parameters in the expressions are a = 1.8 — r(l) if n < 200 and 
a = 1.5 if n > 200, where r(l) is the autocorrelation at lag 1, R = l/2y / 7r, s 
is the standard deviation and IQR is the interquartile range of the data. The exact 
expressions for the last four criteria can be found in the corresponding references, 
given in the third column. 



Criteria 


hi 


h 2 


Reference 


CI 


(4/3n) 1 /5 


(l/n) x /6 


[Silverman, 1986] 


C2 


(4/3n) 1 /5 


(4/5™) 1 / 6 


[Silverman, 1986] 


C3 


IMan- 1 / 5 


an' 1 / 6 


[Harrold et al., 2001] 


C4 


( <b£5).A min(s ,i9R) 


h 


[Silverman, 1986] 
[Wand & Jones. 19951 


C5 




sJ1h x 


C6 


L-stage direct plug-in 


hi 


[Wand & Jones. 19951 


C7 


s/2hi 




C8 


Solve-the-equation plug-in 


hi 


rSheather & Jones. 19911 


C9 


s/2hi 
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Table 4: Ranking and score S of the 5 criteria for b scoring lowest for the ED 
and EP estimators for varying lengths of time series from white noise and linear 
systems. 



Criteria 


S for ED 


Criteria 


S for EP 


Hll 


0.25 


H9 


0.61 


H7 


0.94 


H7 


0.62 


H8 


1.16 


H8 


0.65 


H5 


1.20 


H10 


0.67 


H9 


1.41 


Hll 


0.72 
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Figures captions 



Figure 1: Mean 1(1) as a function of b from 1000 realizations of AR(1) with coef- 
ficient if = r(l) = 0.5 and additive Gaussian noise for (a) the ED and (b) the EP 
estimator and for data sizes as given in the legend. 

Figure 2: (a) Optimal number of bins for different n for AR(1) systems with 
lag one autocorrelation r (1) as in the legend, (b) Graph of the optimal b for a range 
of log 2 (n) and r(l). The results in both panels regard the ED estimator. 

Figure 3: Mean estimated MI with AD estimator as a function of n from 1000 
realizations of (a) normal white noise and (b) AR(1), with r = 0.5 and normal 
input white noise. 

Figure 4: (a) Mean estimated MI with the KNN estimator as a function of n 
from 1000 realizations of AR(1) with r(l) = 0.5 and normal input white noise for 
k as in the legend. The dotted line stands for 1^ . (b) Graph of the optimal k for a 
range of log 2 (n) and r(l). 

Figure 5: Mean estimated MI with the KE estimator as a function of hi from 
1000 realizations of AR(\) with r(l) = 0.5 and normal input white noise for (a) 
h\ = h2, and (b) h\ = \tlhi, and n as in the legend, (c) Graph of the optimal h\ 
for a range of log 2 (n) and r(l). 

Figure 6: (a) Mean estimated MI vs n for the estimators given in the legend 
from simulations on AR(1) with r(l) = 0.5 and normal input white noise. For 
each estimator the optimal free parameter is considered, i.e. Hll for ED, H9 for 
EP, k = 2 for KNN and CI for KE. (b) As (a) but for r(l) = 0.9. 

Figure 7: (a) Optimal number of bins b as a function of the time series length 
n for the ED estimator from 1000 realizations of the Henon map for different lags, 
as given in the legend, (b) Same graph as in (a) but for the Henon map with 20% 
additive noise. In both plots the dotted lines give the optimal number of bins b from 
the suggested criterion in © assuming r(l)=0 and r(l)=l, as given in the plot, (c) 
Mean estimated MI with ED estimator as a function of r from 1000 realizations of 
the Henon map for b = 32, and n as in the legend, (d) As (c) but for Henon map 
with 20% additive noise. 

Figure 8: Mean estimated MI with AD estimator as a function of r from 1000 
realizations of the Henon map with no noise in (a) and with 20% noise in (b). 

Figure 9: (a) Mean estimated MI with KNN estimator as a function of r from 
1000 realizations of the Henon map, for n = 256 and k as in the legend, (b) As in 
(a) but for k = 2 and n as in the legend, (c) As in (a) but for Henon map with 20% 
additive noise. 

Figure 10: (a) Mean estimated MI with KE estimator as a function of r from 
1000 realizations of the Henon map, for n = 512 and bandwidths as in the legend. 
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(b) As in (a) but for n = 4096 and bandwidths as in the legend, (c) and (d) are the 
same as (a) and (b) respectively but for the Henon map with 20% additive noise. 

Figure 11: (a) Mean estimated MI with KE estimator as a function of r from 
1000 realizations of the noise-free Henon map with n = 8192 and nine bandwidth 
selection criteria as given in the legend, (b) As (a) but for 20% additive noise. 

Figure 12: (a) Mean estimated To and standard deviation as error bar as a func- 
tion of b from all time series lengths using the ED estimator on the Mackey-Glass 
system with A = 17, 30, 100, as given in the legend, (b) As in (a) but for the EP 
estimator. 

Figure 13: (a) Mean estimated tq and standard deviation as error bar as a func- 
tion of n using the AD estimator on 1000 realizations of the Mackey-Glass system 
with A = 17, 30, as given in the legend, (b) As in (a) (without standard devi- 
ations) for additive noise with levels 20, 40%, as given in the legend, (c) Mean 
estimated MI with the AD estimator as a function of r from 1000 realizations of 
the Mackey-Glass with A = 100, for n as in the legend. 

Figure 14: Mean estimated To as a function of n using the KNN estimator on 
1000 realizations of the Mackey-Glass system with (a) A = 17, and (b) A = 
30. (c) Mean estimated MI with the KNN estimator as a function of r from 1000 
realizations of the Mackey-Glass with A = 100, for k as in the legend and n = 
2048. 
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